Add: Worker-level chip bootstrap orchestration for distributed L3 by ChaoWao · Pull Request #613 · hw-native-sys/simpler

ChaoWao · 2026-04-21T02:37:36Z

Summary

Wires ChipWorker.bootstrap_context into the Worker factory so an L3 Worker(level>=3, chip_bootstrap_configs=[...]) brings up every chip child's communicator during init() and surfaces a ChipContext list to orch code before the first run().

Note on terminology: the L3 in the title refers to runtime hierarchy Level 3 (chip-level Worker) — see docs/hierarchical_level_runtime.md. Earlier split commits (#608, #610) used (L2)/(L5) tags as split-step labels, which collides with the Level-0..Level-6 hierarchy; this PR drops those labels from new and touched code.

ChipContext dataclass in task_interface — device_id / rank / nranks / device_ctx / local_window_base / actual_window_size / buffer_ptrs: dict[str, int]. The per-buffer dict is built by zipping cfg.buffers with the result's buffer_ptrs, so orch code addresses a named window slice without tracking list indices. A length check before the zip raises RuntimeError on a parent/child buffer-count mismatch instead of silently truncating.
Parent: per-chip ChipBootstrapChannel mailbox (4096 B shared-memory, zero-filled so state starts IDLE) allocated pre-fork. Parent polls each channel with time.sleep(0.001) + 120 s soft timeout; on the first ERROR raises RuntimeError(f"chip {idx} bootstrap failed: {channel.error_message}") and best-effort SIGKILLs every forked child + unlinks every shm so init() raises cleanly without leaking state. chip_contexts is a property that raises before init().
Child: new _chip_process_loop_with_bootstrap runs bootstrap_context first (channel publishes SUCCESS/ERROR), then enters the same task/control poll loop as _chip_process_loop. try/finally runs shutdown_bootstrap then finalize on SHUTDOWN. Bootstrap failure returns via os._exit(0) so the parent's waitpid isn't confused by a non-zero exit code layered on top of the channel's error.
Teardown ordering: _worker.close() → SHUTDOWN → waitpid → unlink sub/chip/next-level mailboxes → bootstrap mailboxes unlinked last, because chip children touch their ChipBootstrapChannel inside shutdown_bootstrap() + finalize().
The original _chip_process_loop and _Worker scheduler wiring are untouched; the bootstrap path is gated on a non-None chip_bootstrap_configs argument and runs eagerly at init() time instead of the usual lazy _start_hierarchical() on first run().

Does not extend to level-4+ recursive Worker children — the _next_level_workers fork path is unchanged; adding distributed bring-up for nested Workers is a follow-up.

Testing

tests/ut/py/test_worker/test_worker_distributed_sim.py — happy path + error path (bogus placement triggers RuntimeError) + chip_contexts-before-init guard + __init__ validation (level<3 reject, length-mismatch reject).
tests/ut/py/test_worker/test_worker_distributed_hw.py — 2-card hardware smoke, drives Worker(level=3, chip_bootstrap_configs=[...]) end-to-end, asserts each rank's device_ctx != 0, local_window_base != 0, actual_window_size >= requested, and buffer_ptrs == {"x": local_window_base}. No comm_barrier — HCCL 507018 stays off the critical path. Lives under tests/ut so the ut-a2a3 job picks it up without xdist's per-worker device-slicing (which would break a 2-device request under tests/st).
pytest tests/ut/py/test_worker with chip_bootstrap_configs=None paths — 59 green, no regression.

Ref: #571 (split), builds on #608, #610.

gemini-code-assist

Code Review

This pull request introduces worker-level chip bootstrap orchestration (L6). It adds the ChipContext dataclass and updates the Worker class to support asynchronous bootstrap of chip children via shared-memory mailboxes. Key changes include a new child process loop that executes bootstrap_context, a timeout-based polling mechanism in the parent to collect results, and enhanced cleanup logic to prevent shared-memory leaks on failure. New hardware and simulation tests are also provided. Feedback is provided regarding a potential silent truncation issue when zipping buffer pointers, suggesting an explicit length check to ensure data integrity.

- Add ChipContext dataclass in task_interface (device_id/rank/nranks + device_ctx, local_window_base, actual_window_size, buffer_ptrs: dict by name) — exposed to L3+ orch code after a successful bring-up - Wire Worker(level>=3, chip_bootstrap_configs=[...]) so each chip child runs ChipWorker.bootstrap_context before entering the main task / control loop; parent blocks on a per-chip ChipBootstrapChannel until every chip reports SUCCESS, assembles ChipContexts, and fails fast on the first ERROR (best-effort SIGKILL + waitpid for the rest, shms unlinked so init() raises cleanly without leaking state) - Explicit length check before zipping cfg.buffers with the channel's buffer_ptrs, so a parent/child buffer-count disagreement raises a descriptive RuntimeError instead of silently producing a truncated buffer_ptrs dict in the ChipContext - Bootstrap mailboxes are allocated pre-fork (SharedMemory zero-fills -> IDLE) and unlinked *after* chip pids are reaped, since chip children touch the channel inside finalize() - Drop stale split-step labels (L2/L5/L6) from new code and from prior chip_bootstrap docstrings since they collide with the runtime Level 0-6 hierarchy documented in docs/hierarchical_level_runtime.md - Add sim UT (happy path + error path + validation + chip_contexts- before-init guard) and hardware UT (2-card, no comm_barrier so the HCCL 507018 known-issue stays off the critical path)

走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有 rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的 output,用 worker.copy_from 读回校验。文件: - kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone) 直接搬过来,只改了一处 include 路径 ("common/comm_context.h" → "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。 - kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。 - main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging 在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。 - tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。 WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。 Co-authored-by: echo_stone <liulei281@huawei.com>

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread python/simpler/worker.py

ChaoWao force-pushed the feat/worker-chip-bootstrap-L6 branch 2 times, most recently from 43e439a to e70e464 Compare April 21, 2026 03:35

ChaoWao changed the title ~~Add: Worker-level chip bootstrap orchestration for distributed L3 (L6)~~ Add: Worker-level chip bootstrap orchestration for distributed L3 Apr 21, 2026

ChaoWao force-pushed the feat/worker-chip-bootstrap-L6 branch from e70e464 to 72a0f2a Compare April 21, 2026 07:58

ChaoWao merged commit fa33039 into hw-native-sys:main Apr 21, 2026
14 checks passed

ChaoWao mentioned this pull request Apr 21, 2026

feat(pr): add allreduce_distributed example, resource-based test dispatch, worker examples with co-located tests #307

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: Worker-level chip bootstrap orchestration for distributed L3#613

Add: Worker-level chip bootstrap orchestration for distributed L3#613
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/worker-chip-bootstrap-L6

ChaoWao commented Apr 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented Apr 21, 2026 •

edited

Loading